01_water_quality_data

library(dplyr)
library(readxl)
library(lubridate)

library(ggplot2)

library(sf)
library(leaflet)

Read in raw data

The raw data is from Water Quality Data in NYC.

water_quality <- read.csv("../../data/raw_data/Drinking_Water_Quality_Distribution_Monitoring_Data_20250416.csv")
sample_site <- read_excel("../../data/raw_data/OpenData_Distribution_Water_Quality_Sampling_Sites_Updated_2021-0618.xlsx", sheet = 1)

EDA and Data Filtering

All variables:

names(water_quality)
##  [1] "Sample.Number"                       "Sample.Date"                        
##  [3] "Sample.Time"                         "Sample.Site"                        
##  [5] "Sample.class"                        "Residual.Free.Chlorine..mg.L."      
##  [7] "Turbidity..NTU."                     "Fluoride..mg.L."                    
##  [9] "Coliform..Quanti.Tray...MPN..100mL." "E.coli.Quanti.Tray...MPN.100mL."

The water quality data contain variables as below (Since we don’t have detailed explanation from the website, all explanation is based on my own understanding):

Sample level:

  • Sample Number: A unique identifier assigned to each water sample.
  • Sample Date: The date when the water sample was collected. It is usually formatted as MM/DD/YYYY.
  • Sample Time: The time when the water sample was taken (timestamp format (ISO 8601)), providing precise timing. Here we don’t need such precise timing.
  • Sample Site: The identifier that indicates the location where the sample was collected. We can map this site to sample_site to get the accurate coordinate.
  • Sample Class: The classification of the sample. For example, “Compliance” indicates that these samples are collected as part of the regulatory monitoring program, while “Operational” help operators manage and maintain system performance, optimize treatment processes, and monitor system conditions. Here we don’t need such precise classification.

Water quality measurement:

  • Residual Free Chlorine (mg/L): Free chlorine is used to maintain a disinfectant residual throughout the distribution system.
  • Turbidity (NTU): Turbidity is a measure of water clarity. It indicates the amount of suspended particles in the water, which can impact both the appearance and the safety of the water.
  • Fluoride (mg/L): Fluoride is often added to public water supplies to help prevent tooth decay.
  • Coliform (Quanti-Tray) (MPN/100mL): Coliform bacteria are used as indicators of general water quality and sanitary conditions.
  • E.coli (Quanti-Tray) (MPN/100mL): Similar to the coliform count. E. coli is a specific fecal indicator; its presence typically signals fecal contamination, which is a significant health concern.
names(water_quality)[6:10] <- c("Residual_Chlorine", "Turbidity", "Fluoride", "Coliform", "Ecoli")

sample date

water_quality <- water_quality %>% 
  mutate(date = as.Date(Sample.Date, format = "%m/%d/%Y")) %>% 
  mutate(year_month = format(date, "%Y-%m")) %>%
  mutate(year = format(date, "%Y")) %>%
  mutate(month = format(date, "%m")) 
water_quality %>%
  group_by(year, month) %>%
  summarise(count = n()) %>%
  ungroup() %>%
  ggplot(aes(x = month, y = year, fill = count)) +
  geom_tile(color = "white") +
  scale_fill_gradient(low = "lightblue", high = "darkblue") +
  labs(title = "Heatmap of Year-Month Combination Count", x = "Month", y = "Year") +
  theme_minimal()

We can find almost all the time we have the measurement, making it possible to do a time scale analysis.

sample site

We first do the deduplication of the reference sample site file. (Sample Site 39550)

names(sample_site)[c(1,3,4)] = c("Sample.Site","X","Y")
sample_site <- sample_site %>% 
  distinct(Sample.Site, .keep_all = TRUE)

We first map the provided coordinates to the data and remove the NA values.

water_quality <- water_quality %>%
  left_join(sample_site %>% 
              select("Sample.Site", "X", "Y"), 
            by = "Sample.Site")
water_quality <- water_quality %>%
  filter(!is.na(X) & !is.na(Y))

The provided sample coordinate seems to be encoded in EPSG:2263 format. We first try to change this into the latitude and longitude based encoding WGS84(EPSG:4326) to make our visualization easier.

water_quality <- st_as_sf(water_quality, coords = c("X", "Y"), crs = 2263)
water_quality$geometry <- st_transform(water_quality$geometry, 4326)

From the changed geometry, we can find they are roughly consistent with the latitude and longitude of NYC, so we would expect the transformation is working.

water_quality$geometry
## Geometry set for 151215 features 
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: -74.24467 ymin: 40.50231 xmax: -73.71687 ymax: 40.90798
## Geodetic CRS:  WGS 84
## First 5 geometries:

We can try to visualize it on the map. Here we get the NYC neighborhood data from official website. The visualization also shows the coordinate mapping is satisfactory.

neighborhoods <- st_read("../../data/raw_data/nynta2020_25a/nynta2020.shp") %>% st_transform(4326)
## Reading layer `nynta2020' from data source 
##   `/Users/dl3738/Documents/Work/Course/Data Visualization/Group_F_TBD/data/raw_data/nynta2020_25a/nynta2020.shp' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 262 features and 11 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: 913175.1 ymin: 120128.4 xmax: 1067383 ymax: 272844.3
## Projected CRS: NAD83 / New York Long Island (ftUS)
neighborhoods
## Simple feature collection with 262 features and 11 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -74.25559 ymin: 40.49613 xmax: -73.70001 ymax: 40.91553
## Geodetic CRS:  WGS 84
## First 10 features:
##    BoroCode BoroName CountyFIPS NTA2020                             NTAName
## 1         3 Brooklyn        047  BK0101                          Greenpoint
## 2         3 Brooklyn        047  BK0102                        Williamsburg
## 3         3 Brooklyn        047  BK0103                  South Williamsburg
## 4         3 Brooklyn        047  BK0104                   East Williamsburg
## 5         3 Brooklyn        047  BK0201                    Brooklyn Heights
## 6         3 Brooklyn        047  BK0202 Downtown Brooklyn-DUMBO-Boerum Hill
## 7         3 Brooklyn        047  BK0203                         Fort Greene
## 8         3 Brooklyn        047  BK0204                        Clinton Hill
## 9         3 Brooklyn        047  BK0261                  Brooklyn Navy Yard
## 10        3 Brooklyn        047  BK0301           Bedford-Stuyvesant (West)
##    NTAAbbrev NTAType CDTA2020
## 1      Grnpt       0     BK01
## 2   Wllmsbrg       0     BK01
## 3  SWllmsbrg       0     BK01
## 4  EWllmsbrg       0     BK01
## 5      BkHts       0     BK02
## 6   DwntwnBk       0     BK02
## 7      FtGrn       0     BK02
## 8    ClntnHl       0     BK02
## 9   BkNvyYrd       6     BK02
## 10   BdSty_W       0     BK03
##                                                   CDTAName Shape_Leng
## 1           BK01 Williamsburg-Greenpoint (CD 1 Equivalent)   28919.56
## 2           BK01 Williamsburg-Greenpoint (CD 1 Equivalent)   28098.03
## 3           BK01 Williamsburg-Greenpoint (CD 1 Equivalent)   18250.28
## 4           BK01 Williamsburg-Greenpoint (CD 1 Equivalent)   43184.80
## 5  BK02 Downtown Brooklyn-Fort Greene (CD 2 Approximation)   14312.51
## 6  BK02 Downtown Brooklyn-Fort Greene (CD 2 Approximation)   30598.67
## 7  BK02 Downtown Brooklyn-Fort Greene (CD 2 Approximation)   23284.62
## 8  BK02 Downtown Brooklyn-Fort Greene (CD 2 Approximation)   18102.97
## 9  BK02 Downtown Brooklyn-Fort Greene (CD 2 Approximation)   39415.64
## 10            BK03 Bedford-Stuyvesant (CD 3 Approximation)   26307.65
##    Shape_Area                       geometry
## 1    35321810 MULTIPOLYGON (((-73.93213 4...
## 2    28854314 MULTIPOLYGON (((-73.95814 4...
## 3    15208961 MULTIPOLYGON (((-73.95024 4...
## 4    52267408 MULTIPOLYGON (((-73.92406 4...
## 5     9982321 MULTIPOLYGON (((-73.99236 4...
## 6    23732978 MULTIPOLYGON (((-73.97906 4...
## 7    17534134 MULTIPOLYGON (((-73.96284 4...
## 8    14566585 MULTIPOLYGON (((-73.96135 4...
## 9    10106807 MULTIPOLYGON (((-73.97893 4...
## 10   36906699 MULTIPOLYGON (((-73.94414 4...
leaflet() %>%
  addProviderTiles(providers$CartoDB.Positron) %>%
  addPolygons(data = neighborhoods, color = "#444", weight = 1, label = ~NTAName) %>%
  addCircleMarkers(data = water_quality$geometry, radius = 1, color = "red")

Chlorine

summary(water_quality$Residual_Chlorine)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
## -9.9900  0.4200  0.5700  0.5624  0.7100  2.2000      49
water_quality <- water_quality %>%
  filter(!is.na(Residual_Chlorine) & Residual_Chlorine >=0)
water_quality %>%
  ggplot(aes(x=Residual_Chlorine)) +
  geom_histogram()

Turbidity

Extreme value: <0.10

Here for easy analysis, we directly change these values to 0.10.

water_quality <- water_quality %>%
  mutate(Turbidity=gsub("[<>]", "", Turbidity)) %>%
  mutate(Turbidity=as.numeric(Turbidity)) %>%
  filter(!is.na(Turbidity) & Turbidity >= 0)
water_quality %>%
  filter(Turbidity > 1.5) %>%
  nrow()
## [1] 332
water_quality %>%
  filter(Turbidity > 1.5) %>%
  nrow() / nrow(water_quality)
## [1] 0.002196334
water_quality %>%
  filter(Turbidity <= 1.5) %>%
  ggplot(aes(x=Turbidity)) +
  geom_histogram(binwidth = 0.1)

Fluoride

nrow(water_quality %>% filter(!is.na(Fluoride) & Fluoride != ""))
## [1] 19961

Since the NA value for Fluoride is too much and this variable is not very correlated with water quality, we can just discard it.

Coliform

Extreme value: <1, >200.5

water_quality <- water_quality %>%
  mutate(Coliform=gsub("[<>]", "", Coliform)) %>%
  mutate(Coliform=as.numeric(Coliform)) %>%
  filter(!is.na(Coliform) & Coliform >= 0)
water_quality %>%
  filter(Coliform != 1) %>%
  nrow()
## [1] 416
water_quality %>%
  filter(Coliform != 1) %>%
  nrow() / nrow(water_quality)
## [1] 0.002753253
water_quality %>%
  ggplot(aes(x=Coliform)) +
  geom_histogram()

Ecoli

Extreme value: <1

water_quality <- water_quality %>%
  mutate(Ecoli=gsub("[<>]", "", Ecoli)) %>%
  mutate(Ecoli=as.numeric(Ecoli)) %>%
  filter(!is.na(Ecoli) & Ecoli >= 0)
water_quality %>%
  filter(Ecoli != 1) %>%
  nrow()
## [1] 1
water_quality %>%
  filter(Ecoli != 1) %>%
  nrow() / nrow(water_quality)
## [1] 6.618396e-06
water_quality %>%
  ggplot(aes(x=Ecoli)) +
  geom_histogram()

Variant selection

coordinates <- st_coordinates(water_quality)
water_quality <- water_quality %>%
  mutate(longitude = coordinates[,1]) %>%
  mutate(latitude = coordinates[,2])
water_quality <- water_quality %>%
  select(Sample.Number, year_month, year, month, Residual_Chlorine, Turbidity, longitude, latitude) %>%
  st_drop_geometry()
write.csv(water_quality, "../../data/water_quality.csv", row.names = FALSE)

Sample plot

time <- "2024-05"
variant <- "Residual_Chlorine"

Geometric mapping

We can change the time scale to see different patterns.

plot_data <- water_quality %>%
  filter(as.character(year_month) == time) %>%
  select(all_of(c(variant, "geometry"))) %>%
  group_by(geometry) %>%
  summarise(mean_value = mean(.data[[variant]], na.rm = TRUE))

pal <- colorNumeric(palette = "YlOrRd", domain = plot_data[["mean_value"]])

leaflet() %>%
  addProviderTiles(providers$CartoDB.Positron) %>%
  addPolygons(data = neighborhoods, color = "#444", weight = 1, label = ~NTAName) %>%
  addCircleMarkers(
    data = plot_data,
    radius = 5,
    color = pal(plot_data[["mean_value"]]),
    fillOpacity = 0.8,
    stroke = FALSE,
    popup = ~paste(variant, ": ", mean_value)
  ) %>%
  addLegend(
    pal = pal,
    values = plot_data[["mean_value"]],
    title = variant,
    position = "bottomright"
  )

Trend along time scale

time_start <- ym("2023-01")
time_end <- ym("2024-12")
variant <- "Residual_Chlorine"
neighbourhood <- "Harlem (South)"
sub_neighbourhood <- neighborhoods %>%
  filter(NTAName == neighbourhood)

plot_data <- water_quality %>%
  filter(ym(year_month) >= time_start & ym(year_month) <= time_end) %>%
  select(all_of(c(variant, "geometry", "year_month"))) %>%
  group_by(geometry, year_month) %>%
  summarise(mean_value = mean(.data[[variant]], na.rm = TRUE)) %>%
  ungroup()

if (neighbourhood != "Whole NYC")
  plot_data <- st_filter(plot_data, sub_neighbourhood)

plot_data %>%
  ggplot(aes(x=year_month, y=mean_value, group=1)) +
  geom_smooth(method = "loess", se=TRUE) +
  labs(
        title = paste(variant, "change along time in", neighbourhood),
        x = "Time",
        y = variant
    ) +
    theme(
        plot.title = element_text(face = "bold", size = 13),
        axis.text.y = element_text(face = "bold"),
        axis.text.x = element_text(face = "bold", angle = 45, hjust = 1),
        axis.title.y = element_text(face = "bold"),
        axis.title.x = element_text(face = "bold"),
        panel.grid = element_blank(),
        panel.background = element_blank(),
        plot.background = element_blank(),
        axis.line = element_line(color = "black")
    )